Today’s date is 2019-02-18
Template for using scmap to assign cell types to single cell RNAseq transcriptional data. The goal of this document is to provide an editable template showing a workflow to download datasets from GEO and use these to compare to a personally generated dataset. The package scmap https://doi.org/10.1038/nmeth.4644 or github https://github.com/hemberg-lab/scmap can compare your single cell transcriptomes to previously annotated datsets and assign cell type and cluster information based on this.
First we need to load libraries required for the analysis.
## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Loading required package: ggplot2
## Loading required package: cowplot
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loading required package: ggraph
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ purrr 0.2.5
## ✔ tidyr 0.8.1 ✔ dplyr 0.7.6
## ✔ readr 1.1.1 ✔ stringr 1.3.1
## ✔ tibble 1.4.2 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks IRanges::collapse()
## ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count() masks matrixStats::count()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks Matrix::expand(), S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ cowplot::ggsave() masks ggplot2::ggsave()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ purrr::simplify() masks DelayedArray::simplify()
## ✖ dplyr::slice() masks IRanges::slice()
There are two files we need to initially load into our run: the annotated dataset and test dataset. For this run, the annotated dataset is the single cell RNAseq Human Fetal Kidney dataset from Lindstrom (citation) located at GSE102596. The second is a combined dataset of two organoids generated from the same line, different batches and different ages, previously compared in our lineage tracing paper (citation).
getGEOSuppFiles(GEO = "GSE102596") # downloads the file from GEO into the working directory
## size
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 7362560
## isdir
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar FALSE
## mode
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 770
## mtime
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 2019-02-18 14:22:35
## ctime
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 2019-02-18 14:22:35
## atime
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 2019-02-17 21:16:37
## uid
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 31517
## gid
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar 30234
## uname
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar sean.wilson
## grname
## /group/kidn1/Group-Little_MCRI/People/Sean/PhD/R-projects/cool-resources/GSE102596/GSE102596_RAW.tar kidn1.dl
untar("GSE102596/GSE102596_RAW.tar", exdir = "GSE102596/") # untar (unpacks) file
hfk.data <- read.table("GSE102596/GSM2741551_count-table-human16w.tsv.gz", sep = "\t") # converts the expression matrix into a data.table
reporters <- readRDS("/group/kidn1/Group-Little_MCRI/GROUPS/Stem cells and Regeneration/Single Cell/Reporters_18vs25Days/output/seurat/D18vsD25_combined_20clusters_seurat.rds") # loads previously prepared .rds file of organoid data
The annotated datset is downloaded, however it isn’t actually annotated at this stage. As such, we need to cluster and annotate the dataset, and we will use Seurat to do this.
hfk <- CreateSeuratObject(raw.data = hfk.data, project = "Human_Kidney_hfk")
hfk <- NormalizeData(hfk)
hfk <- FindVariableGenes(object = hfk, mean.function = ExpMean, dispersion.function = LogVMR,
x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
hfk <- ScaleData(hfk, display.progress = F)
hfk <- RunPCA(object = hfk, pc.genes = hfk@var.genes, do.print = F)
#PCAPlot(object = geo, dim.1 = 1, dim.2 = 2)
PCElbowPlot(object = hfk)
hfk <- FindClusters(object = hfk, reduction.type = "pca", dims.use = 1:10,
resolution = seq(0, 2, 0.2), print.output = 0, save.SNN = TRUE)
hfk <- RunTSNE(hfk, reduction.use = "pca", dims.use = 1:10,
do.fast = T)
clustree(hfk)
TSNEPlot(object = hfk, group.by = "res.0.6")
hfk <- SetIdent(hfk, ident.use = hfk@meta.data$res.0.6)
FeaturePlot(hfk, features.plot = c("GATA3", "COL3A1", "NPHS1", "EPCAM", "SIX1", "PAX2"), cols.use = c("grey", "red"), nCol = 3)
markers <- FindAllMarkers(hfk, min.pct = 0.1, logfc.threshold = 0.5)
markers.list <- lapply(0:(length(unique(hfk@ident))-1), function(x) {
markers %>%
dplyr::filter(cluster == x, p_val_adj < 0.05, avg_logFC > 0) %>%
dplyr::arrange(-avg_logFC) %>%
select(Gene = gene, LogFC = avg_logFC, pVal = p_val_adj)
})
Once this is completed, and you are happy with the clustering and annotation of these clusters, the following chunk will rename each cluster and apply that identifier to each cell.
current.cluster.ids <- c(0,1,2,3,4,5,6,7,8,9,10)
new.cluster.ids <- c("Stroma 1", "Stroma 2", "Neph Progenitors", "Stroma DCN", "Endothelium", "Stroma 3",
"Macrophage", "Coll Duct/Dist Tubule", "Prox Tubule/Gloms", "Cell Cycle", "Blood")
hfk@ident <- plyr::mapvalues(hfk@ident,from = current.cluster.ids, to=new.cluster.ids)
TSNEPlot(hfk, group.by = "ident")
The scmap package requires the data to be in sce format. This is easy to do with the Convert() function
hfk.sce <- Convert(from = hfk, to = "sce")
Select the most informative genes from the input dataset (highlighted in red on the graph). I found that 50-100 features gives a “good” result, i.e. the majority of cells are assigned to their original clusters, there aren’t many unassigned. Can increase or decrease this as required.
rowData(hfk.sce)$feature_symbol <- rownames(hfk.sce)
hfk.sce <- hfk.sce[!duplicated(rownames(hfk.sce)),]
hfk.sce <- selectFeatures.SingleCellExperiment(hfk.sce, suppress_plot = F, n_features = 100)
The scmap-cluster index of a reference dataset is created by finding the median gene expression for each cluster.
hfk.sce <- indexCluster.SingleCellExperiment(hfk.sce, cluster_col = "ident")
head(metadata(hfk.sce)$scmap_cluster_index)
## Stroma DCN Stroma 2 Endothelium Stroma 1 Neph Progenitors
## ACTA2 0.0000000 1.014370 0.000000 0.000000 0.000000
## AKAP12 2.0970781 1.583472 0.000000 0.000000 0.000000
## ANXA2 2.4042071 1.499302 2.623101 1.225246 0.000000
## ATF3 0.4769694 1.504661 1.789264 1.298508 0.000000
## B2M 2.4309264 2.513346 3.810974 2.290613 1.345499
## BTG2 1.1418092 0.000000 0.000000 0.000000 1.146254
## Cell Cycle Macrophage Prox Tubule/Gloms Stroma 3
## ACTA2 0.0000000 0.000000 0.000000 0.6721774
## AKAP12 0.0000000 0.000000 0.000000 1.0782354
## ANXA2 0.0000000 0.000000 0.000000 1.2260957
## ATF3 0.7763372 1.377552 0.000000 1.2009944
## B2M 2.3296700 4.012575 1.285677 2.4696752
## BTG2 0.0000000 1.675700 1.219164 0.7714619
## Coll Duct/Dist Tubule Blood
## ACTA2 0.000000 0
## AKAP12 0.000000 0
## ANXA2 1.170071 0
## ATF3 1.684154 0
## B2M 1.155710 0
## BTG2 2.020063 0
heatmap(as.matrix(metadata(hfk.sce)$scmap_cluster_index))
We now have an index of median gene expression relating to each of the clusters from the input data. Now we can use this index to assign identity to an input dataset. To test this, we can project the index dataset back onto itself, then we can use a different dataset and see where those cells are assigned. The threshold of simmilarity required to assign a cell to a cluster can be adjusted.
scmapCluster_results <- scmapCluster.SingleCellExperiment(projection = hfk.sce,
index_list = list(hfk = metadata(hfk.sce)$scmap_cluster_index),
threshold = 0.5)
# visualise by sankey graphing method
plot(
getSankey(
colData(hfk.sce)$ident,
scmapCluster_results$scmap_cluster_labs[,'hfk'],
plot_height = 800,
plot_width = 1200
)
)
## starting httpd help server ... done
# Plot the similarity value for each cell
qplot(x = scmapCluster_results$scmap_cluster_siml[,'hfk'],
binwidth = 0.01)
## Warning: Removed 53 rows containing non-finite values (stat_bin).
# TSNE plot of reassigned cells by cluster
hfk@meta.data$scmap <- scmapCluster_results$scmap_cluster_labs[,'hfk']
TSNEPlot(hfk, do.return=T, no.legend = F, do.label = T)
TSNEPlot(hfk, do.label =F, group.by = "scmap")
There is a small amount of shuffling of identities doing this analysis, the cluster most affected looks to be the nephron progenitors. This could be because there is a stronger score with a reduced number of features in cells closer to a committed phate instead of their progenitor state. For now, let’s say that there is an acceptable assignment of cell types using our feature based index. Now, we can compare and assign a seperate dataset.
reporters <- SetIdent(reporters, ident.use = reporters@meta.data$res.1.6)
TSNEPlot(reporters, do.label = T)
rep.sce <- Convert(from = reporters, to = "sce")
rowData(rep.sce)$feature_symbol <- rownames(rep.sce)
#rep.sce <- selectFeatures.SingleCellExperiment(rep.sce, n_features = 500, suppress_plot = F)
#rep.sce <- indexCluster.SingleCellExperiment(rep.sce, cluster_col = "res.1.6")
#heatmap(as.matrix(metadata(rep.sce)$scmap_cluster_index))
scmapCluster_results_rep <- scmapCluster.SingleCellExperiment(
projection = rep.sce,
index_list = list(
reporters = metadata(hfk.sce)$scmap_cluster_index
), threshold = 0.2
)
head(scmapCluster_results_rep$scmap_cluster_labs)
## reporters
## [1,] "Cell Cycle"
## [2,] "Prox Tubule/Gloms"
## [3,] "Endothelium"
## [4,] "Coll Duct/Dist Tubule"
## [5,] "Cell Cycle"
## [6,] "Stroma 3"
head(scmapCluster_results_rep$scmap_cluster_siml)
## reporters
## [1,] 0.8130867
## [2,] 0.4228191
## [3,] 0.6564854
## [4,] 0.7617856
## [5,] 0.8151310
## [6,] 0.7605512
plot(
getSankey(
colData(rep.sce)$"res.1.6",
scmapCluster_results_rep$scmap_cluster_labs[,'reporters'],
plot_height = 800,
plot_width = 1200
)
)
qplot(x = scmapCluster_results_rep$scmap_cluster_siml[,'reporters'], binwidth = 0.01)
## Warning: Removed 350 rows containing non-finite values (stat_bin).
reporters@meta.data$scmap <- scmapCluster_results_rep$scmap_cluster_labs[,'reporters']
# Plot scmap assigned cluster identity to tsne plot of data
TSNEPlot(reporters, do.return=T, no.legend = F, do.label = T)
TSNEPlot(reporters, do.label =T, group.by = "scmap")
Interestingly, there are two peaks in the distribution of alignment scores, one around 0.5 and one around 0.8. What cell types contain the lower aligment scores compared to the higher ones? My guess is that the multiple “stromal” populations contain similar scores for their feature index, and will those cells will be less specific to any one cluster.
reporters@meta.data$scores <- round(scmapCluster_results_rep$scmap_cluster_siml[, 'reporters'], digits = 1) # round so as to make graphing easier
TSNEPlot(reporters, group.by = "scores")
The cell cycle cluster on the left and the stromal population left centre give very strong similarity to the corresponding index in the human fetal kidney data. The weakest regions of similarity are where the podocyte cells (top middle) and subset of a stromal cluster (lower middle right) are located. The HFK dataset doesn’t have many podocytes, and there will be differences in stromal pops, so these don’t seem out of place. there is a medium-high similarity in the majority of nephron lineages. I think this has done a decent enough job at assigning cell types based on a limited dataset index (the HFK), and the differences in experiments and cell lines (organoids generated from iPSCs vs real human fetal kidney samples).
This template can serve as a launch point for any analysis using scmap. I’ll continue with our other datasets, also try and use other available datasets as an index, as well as using annotated organoids against eachother. I can also try to “merge” the two datasets together prior using a cca/multicca, subset individual datasets and repeat the process. This might eliminate more batch variation and cause higher correlation scores.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.7 (Final)
##
## Matrix products: default
## BLAS: /usr/local/installed/R/3.5.0/lib64/R/lib/libRblas.so
## LAPACK: /usr/local/installed/R/3.5.0/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 forcats_0.3.0
## [3] stringr_1.3.1 dplyr_0.7.6
## [5] purrr_0.2.5 readr_1.1.1
## [7] tidyr_0.8.1 tibble_1.4.2
## [9] tidyverse_1.2.1 clustree_0.2.2
## [11] ggraph_1.0.2 Seurat_2.3.4
## [13] Matrix_1.2-14 cowplot_0.9.3
## [15] ggplot2_3.0.0 GEOquery_2.48.0
## [17] SingleCellExperiment_1.2.0 SummarizedExperiment_1.10.1
## [19] DelayedArray_0.6.5 BiocParallel_1.14.2
## [21] matrixStats_0.54.0 Biobase_2.40.0
## [23] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
## [25] IRanges_2.14.11 S4Vectors_0.18.3
## [27] BiocGenerics_0.26.0 scmap_1.4.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.1.0 snow_0.4-2 backports_1.1.2
## [4] Hmisc_4.1-1 plyr_1.8.4 igraph_1.2.2
## [7] lazyeval_0.2.1 splines_3.5.0 digest_0.6.16
## [10] foreach_1.4.4 htmltools_0.3.6 viridis_0.5.1
## [13] lars_1.2 gdata_2.18.0 magrittr_1.5
## [16] checkmate_1.8.5 cluster_2.0.7-1 mixtools_1.1.0
## [19] ROCR_1.0-7 limma_3.36.2 modelr_0.1.2
## [22] R.utils_2.7.0 colorspace_1.3-2 rvest_0.3.2
## [25] ggrepel_0.8.0 haven_1.1.2 crayon_1.3.4
## [28] RCurl_1.95-4.11 jsonlite_1.5 bindr_0.1.1
## [31] survival_2.42-6 zoo_1.8-3 iterators_1.0.10
## [34] ape_5.1 glue_1.3.0 gtable_0.2.0
## [37] zlibbioc_1.26.0 XVector_0.20.0 kernlab_0.9-27
## [40] prabclus_2.2-6 DEoptimR_1.0-8 scales_1.0.0
## [43] mvtnorm_1.0-8 bibtex_0.4.2 Rcpp_0.12.18
## [46] metap_1.0 dtw_1.20-1 viridisLite_0.3.0
## [49] htmlTable_1.12 units_0.6-0 reticulate_1.10
## [52] foreign_0.8-71 bit_1.1-14 proxy_0.4-22
## [55] mclust_5.4.1 SDMTools_1.1-221 Formula_1.2-3
## [58] tsne_0.1-3 htmlwidgets_1.2 httr_1.3.1
## [61] gplots_3.0.1 RColorBrewer_1.1-2 fpc_2.1-11.1
## [64] acepack_1.4.1 modeltools_0.2-22 ica_1.0-2
## [67] pkgconfig_2.0.2 R.methodsS3_1.7.1 flexmix_2.3-14
## [70] nnet_7.3-12 labeling_0.3 tidyselect_0.2.4
## [73] rlang_0.2.2 reshape2_1.4.3 cellranger_1.1.0
## [76] munsell_0.5.0 tools_3.5.0 cli_1.0.0
## [79] broom_0.5.0 ggridges_0.5.0 evaluate_0.11
## [82] yaml_2.2.0 knitr_1.20 bit64_0.9-7
## [85] tidygraph_1.1.1 fitdistrplus_1.0-9 robustbase_0.93-2
## [88] caTools_1.17.1.1 randomForest_4.6-14 RANN_2.6
## [91] pbapply_1.3-4 nlme_3.1-137 R.oo_1.22.0
## [94] xml2_1.2.0 hdf5r_1.0.1 compiler_3.5.0
## [97] rstudioapi_0.7 curl_3.2 png_0.1-7
## [100] e1071_1.7-0 tweenr_0.1.5 stringi_1.2.4
## [103] lattice_0.20-35 trimcluster_0.1-2.1 googleVis_0.6.3
## [106] pillar_1.3.0 Rdpack_0.9-0 lmtest_0.9-36
## [109] data.table_1.11.4 bitops_1.0-6 irlba_2.3.2
## [112] gbRd_0.4-11 R6_2.2.2 latticeExtra_0.6-28
## [115] KernSmooth_2.23-15 gridExtra_2.3 codetools_0.2-15
## [118] MASS_7.3-50 gtools_3.8.1 assertthat_0.2.0
## [121] rprojroot_1.3-2 withr_2.1.2 GenomeInfoDbData_1.1.0
## [124] diptest_0.75-7 doSNOW_1.0.16 hms_0.4.2
## [127] grid_3.5.0 rpart_4.1-13 class_7.3-14
## [130] rmarkdown_1.10.2 segmented_0.5-3.0 Rtsne_0.13
## [133] ggforce_0.1.3 lubridate_1.7.4 base64enc_0.1-3